NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Foundation Models for AI-enabled Biological Design

Moldwin, Asher; Shehu, Amarda (May 2025, AAAI 2025 FMs4Bio Workshop)

This paper surveys foundation models for AI-enabled biological design, focusing on recent developments in applying large-scale, self-supervised models to tasks such as protein engineering, small molecule design, and genomic sequence design. Though this domain is evolving rapidly, this survey presents and discusses a taxonomy of current models and methods. The focus is on challenges and solutions in adapting these models for biological applications, including biological sequence modeling architectures, controllability in generation, and multi-modal integration. The survey concludes with a discussion of open problems and future directions, offering concrete next-steps to improve the quality of biological sequence generation.
more » « less
Full Text Available
BirdieDNA: Reward-Based Pre-Training for Genomic Sequence Modeling

Blouir, Samuel; Circi, Defne; Moldwin, Asher; Shehu, Amarda (April 2024, ICLR MLGenX Workshop)

Transformer-based language models have shown promise in genomics but face challenges unique to DNA, such as sequence lengths spanning hundreds of millions of base pairs and subtle long-range dependencies. Although next-token prediction remains the predominant pre-training objective (inherited from NLP), recent research suggests that multi-objective frameworks can better capture complex structure. In this work, we explore whether the Birdie framework, a reinforcement learning-based, mixture-of-objectives pre-training strategy, can similarly benefit genomic foundation models. We compare a slightly modified Birdie approach with a purely autoregressive, next token prediction baseline on standard Nucleotide Transformer benchmarks. Our results show performance gains in the DNA domain, indicating that mixture-of-objectives training could be a promising alternative to next token prediction only pre-training for genomic sequence modeling.
more » « less
Full Text Available
A More Informative and Reproducible Remote Homology Evaluation for Protein Language Models

Moldwin, Asher; Kabir, Anowarul; Shehu, Amarda (February 2024, AAAI 2024 LLMs4Bio Workshop)

Recent studies exploring the abilities of transformer-based protein language models have highlighted their performance on the task of remote homology detection, but have not provided datasets or evaluation procedures geared toward properly measuring performance on this task. With the goal of obtaining more informative and reproducible results, we offer a detailed procedure for constructing datasets and evaluating remote homology detection performance in a way that allows detailed analyses to be performed that shed light on the remote homology detection performance throughout the “twilight zone” of low sequence similarity. Using the proposed procedures, we found that three stateof-the-art protein language models exhibit diminishing performance when the pairwise sequence similarity between the query sequence and other proteins is restricted to below 35% identity.
more » « less
Full Text Available
A More Informative and Reproducible Remote Homology Evaluation for Protein Language Models

Moldwin, Asher; Kabir, Anowarul; Shehu, Amarda (February 2024, LLMs4Bio)

Recent studies exploring the abilities of transformer-based protein language models have highlighted their performance on the task of remote homology detection, but have not provided datasets or evaluation procedures geared toward properly measuring performance on this task. With the goal of obtaining more informative and reproducible results, we offer a detailed procedure for constructing datasets and evaluating remote homology detection performance in a way that allows detailed analyses to be performed that shed light on the remote homology detection performance throughout the “twilight zone” of low sequence similarity. Using the proposed procedures, we found that three stateof-the-art protein language models exhibit diminishing performance when the pairwise sequence similarity between the query sequence and other proteins is restricted to below 35% identity.
more » « less
Full Text Available
In the twilight zone of protein sequence homology: do protein language models learn protein structure?

https://doi.org/10.1093/bioadv/vbae119

Kabir, Anowarul; Moldwin, Asher; Bromberg, Yana; Shehu, Amarda (January 2024, Bioinformatics Advances)
Gogovi, Gideon (Ed.)
Abstract MotivationProtein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent. ResultsWe address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the “twilight zone” of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak. Availability and implementationWe believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.
more » « less
Full Text Available
A Comparative Analysis of Transformer-based Protein Language Models for Remote Homology Prediction

https://doi.org/10.1145/3584371.3612942

Kabir, Anowarul; Moldwin, Asher; Shehu, Amarda (September 2023, ACM)

Full Text Available

Search for: All records